This comprehensive guide demonstrates two-dimensional functional clustering analysis using the MultiConnector package. Unlike one-dimensional analysis that focuses on a single measurement type, two-dimensional analysis simultaneously models multiple related measurements, providing richer insights into complex biological or clinical patterns.
We use the MCL (Multiple Correspondence Longitudinal) dataset as our case study, which contains two simultaneous measurements (PB and BM) tracked over time. This approach is particularly valuable for:
Traditional clustering often treats each measurement independently, potentially missing important relationships between variables. Two-dimensional functional clustering:
The MCL dataset contains:
# Load required packages
library(MultiConnector)
library(tibble)
library(dplyr)
library(ggplot2)
# Load the MCL dataset
# This dataset contains two-dimensional time series data
TimeSeries <- readRDS(system.file("Data/MCL/TimeSeries.rds", package="MultiConnector"))
Annotations <- readRDS(system.file("Data/MCL/Annotations.rds", package="MultiConnector"))
# Display basic dataset information
cat("Dataset Overview:")#> Dataset Overview:
#>
#> - Time series data dimensions: 1341 5
#>
#> - Annotations dimensions: 117 6
#>
#> - Unique subjects: 117
#>
#> - Measurement types: PB, BM
#>
#> - Time range: 62 1666
The first step is creating a CONNECTORData object that
properly structures our two-dimensional time series data.
# Create the main data object for two-dimensional analysis
Data <- ConnectorData(tibble(TimeSeries), tibble(Annotations))#> ###############################
#> Data loaded...
#> Number of Observations:380
#> Number of distinct measures:2
#> Average length:6.647368
#> CONNECTORData object with:
#> - Curves: 1263
#> - Subjects: 95
#> - Measures: 2
#> CONNECTORData object with:
#> - Subjects: 95
#> - Measures: 2
#>
#> Lengths summary:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3 5 7 6.647368 8 9
Comprehensive exploration of the two-dimensional time series structure and patterns.
This plot shows the raw time series for both PB and BM measurements, providing an initial view of the data structure and variability patterns.
# Examine available features
available_features <- getAnnotations(Data)
cat("Available features for analysis:")#> Available features for analysis:
#> [1] "subjID" "Arm" "TTP" "TTP_date" "TTP_dateReg"
#> [6] "Dynamic"
# Plot 2.2: Time series colored by key features
# This helps identify potential relationships between features and temporal patterns
# TTP (Time To Progression) analysis
plot(Data, feature="TTP")These visualizations reveal how different clinical features (TTP, treatment arms) may relate to the temporal patterns in our two measurements.
# Plot 2.3: Detailed time distribution analysis
# This is crucial for two-dimensional data to understand sampling patterns
# Comprehensive time analysis
plotTimes(Data, large=TRUE)The time distribution analysis is particularly important for two-dimensional data as it reveals:
For two-dimensional analysis, we need to determine optimal spline dimensions for both measurements simultaneously.
# Cross-validation for optimal spline basis dimensions
# This step is crucial as each measurement type may require different flexibility
cat("Estimating optimal spline dimensions...")#> Estimating optimal spline dimensions...
#>
#> This process may take several minutes for two-dimensional data...
# Estimate dimensions for both measurements
# Testing p values from 2 to 6 for both PB and BM
CrossLogLikePlot <- estimatepDimension(Data, p=2:6, cores=5)#> [1] "Total time for PB : 46.96 secs"
#> [1] "Total time for BM : 46.67 secs"
#> [1] "Total time: 1.56 mins"
#> $PB
#>
#> $BM
The dimension estimation plot shows cross-validation results for both measurements:
# Based on the cross-validation results, select optimal dimensions
# These values would typically be chosen based on the plotted results
optimal_p <- c("PB" = 4, "BM" = 4)
cat("Selected optimal dimensions:")#> Selected optimal dimensions:
#>
#> - PB measurement: p = 4
#>
#> - BM measurement: p = 4
#>
#>
#> Dimension selection rationale:
#>
#> - Higher p: More flexible curves, captures fine details
#>
#> - Lower p: Smoother curves, reduces overfitting
#>
#> - Optimal p: Balance between flexibility and generalization
The core clustering step that simultaneously models both measurements to identify coordinated patterns.
#> Starting two-dimensional clustering analysis...
#>
#> Estimated computation time: ~5-10 minutes with 5 cores
#>
#> (Time increases significantly with number of subjects and measurements)
# Perform clustering with multiple cluster numbers
# This is computationally intensive for two-dimensional data
clusters <- estimateCluster(Data,
G = 2:6, # Test 2-6 clusters
p = optimal_p, # Use optimal spline dimensions for both measurements
runs = 20, # Multiple runs for stability
cores = 5) # Parallel processing#> [1] "Total time: 4.88 mins"
#>
#> Clustering analysis completed!
The clustering quality plot shows:
Key interpretations: - Optimal G: Look for elbow points or clear minima in fDB - Stability: Consistent results across runs indicate robust solutions - Computational scaling: Time increases with cluster number and data complexity
Select the optimal clustering configuration based on quality metrics.
# Select optimal configuration based on quality metrics
# For demonstration, using G=3 clusters with MinfDB criterion
optimal_G <- 3
selection_criterion <- "MinfDB"
cat("Cluster selection:")#> Cluster selection:
#>
#> - Number of clusters (G): 3
#>
#> - Selection criterion: MinfDB
#>
#> - Rationale: Minimum functional Davies-Bouldin index
# Create the final clustered data object
ClusterData <- selectCluster(clusters, G=optimal_G, selection_criterion)
cat("\n\nFinal clustering solution:")#>
#>
#> Final clustering solution:
#>
#> - Clusters identified: 3
#>
#> - Cluster labels: A, B, C
Comprehensive visualization of clustering results across both measurements.
This plot reveals:
# Plot 6.2: Clusters colored by clinical features
# This reveals associations between clusters and clinical characteristics
# TTP (Time To Progression) patterns
plot(ClusterData, feature="TTP")These visualizations help identify:
# Analyze cluster composition across features
available_features <- getAnnotations(ClusterData)
cat("Features available for composition analysis:")#> Features available for composition analysis:
#> [1] "Arm" "TTP" "TTP_date" "TTP_dateReg" "Dynamic"
#>
#> === CLUSTER COMPOSITION ANALYSIS ===
# TTP distribution across clusters
if ("TTP" %in% available_features) {
cat("\n\n1. TTP Distribution Across Clusters:")
ttp_dist <- clusterDistribution(ClusterData, feature="TTP")
print(ttp_dist)
}#>
#>
#> 1. TTP Distribution Across Clusters:# A tibble: 3 × 5
#> TTP `Cluster 1` `Cluster 2` `Cluster 3` Total
#> <chr> <int> <int> <int> <dbl>
#> 1 0 0 2 55 57
#> 2 1 5 13 20 38
#> 3 TOTAL 5 15 75 95
# Treatment arm distribution
if ("Arm" %in% available_features) {
cat("\n\n2. Treatment Arm Distribution:")
arm_dist <- clusterDistribution(ClusterData, feature="Arm")
print(arm_dist)
}#>
#>
#> 2. Treatment Arm Distribution:# A tibble: 4 × 5
#> Arm `Cluster 1` `Cluster 2` `Cluster 3` Total
#> <chr> <int> <int> <int> <dbl>
#> 1 0 1 10 39 50
#> 2 1 3 2 34 39
#> 3 NULL 1 3 2 6
#> 4 TOTAL 5 15 75 95
# Display cluster assignments
cluster_assignments <- getClusters(ClusterData)
cat("\n\n3. Cluster Assignment Summary:")#>
#>
#> 3. Cluster Assignment Summary:
#>
#> - Total subjects clustered: 95
cluster_summary <- table(cluster_assignments$Cluster)
for (i in names(cluster_summary)) {
cat("\n- Cluster", i, ":", cluster_summary[i], "subjects")
}#>
#> - Cluster 1 : 5 subjects
#> - Cluster 2 : 15 subjects
#> - Cluster 3 : 75 subjects
Detailed analysis of individual subjects within the two-dimensional clustering context.
# Select a representative subject for detailed analysis
example_subjects <- unique(cluster_assignments$subjID)[1:3]
selected_subject <- example_subjects[1]
cat("Detailed analysis for subject:", selected_subject)#> Detailed analysis for subject: Subject 1105
# Comprehensive subject information
subject_info <- SubjectInfo(ClusterData, subjIDs = selected_subject)
cat("\nSubject cluster assignment:")#>
#> Subject cluster assignment:
#>
#> Subject Subject 1105 belongs to Cluster 3
# Compare multiple subjects across clusters
multi_subject_info <- SubjectInfo(ClusterData, subjIDs = example_subjects)
cat("Multi-subject comparison:")#> Multi-subject comparison:
#>
#> Subjects analyzed: Subject 1105, Subject 1205, Subject 1206
# Show cluster assignments
for (i in 1:nrow(multi_subject_info$cluster_table)) {
cat("\n- Subject", multi_subject_info$cluster_table$subjID[i],
"-> Cluster", multi_subject_info$cluster_table$cluster[i])
}#>
#> - Subject Subject 1105 -> Cluster 3
#> - Subject Subject 1205 -> Cluster 3
#> - Subject Subject 1206 -> Cluster 2
Comprehensive validation of the two-dimensional clustering solution.
#> Performing cluster validation...
validation_metrics <- validateCluster(ClusterData)
# Display validation plot
validation_metrics$plot#>
#>
#> Validation Metrics Interpretation:
#>
#> - Silhouette Analysis: Measures how well subjects fit their assigned clusters
#>
#> * Values near +1: Well-clustered subjects
#>
#> * Values near 0: Borderline cases
#>
#> * Negative values: Potentially misclassified subjects
#>
#> - Entropy Analysis: Measures uncertainty in cluster assignments
#>
#> * Lower entropy: More confident assignments
#>
#> * Higher entropy: More uncertain assignments
# Extract and summarize validation metrics
if (!is.null(validation_metrics$metrics)) {
cat("\n\nValidation Summary:")
print(validation_metrics$metrics)
}
# Additional quality assessment
quality_metrics <- ClusterData@TTandfDBandSil
cat("\n\nClustering Quality Metrics:")#>
#>
#> Clustering Quality Metrics:
#> # A tibble: 1 × 4
#> fDB TT G Sil
#> <dbl> <dbl> <int> <dbl>
#> 1 0.788 3815. 3 0.671
Specialized visualizations for understanding two-dimensional clustering patterns.
# Discriminant analysis visualization
# This projects the high-dimensional functional data into lower-dimensional space
cat("Generating discriminant analysis plots...")#> Generating discriminant analysis plots...
# Feature-enhanced discriminant plot
if ("TTP" %in% available_features) {
discr_feature <- DiscriminantPlot(ClusterData, feature = "TTP")
discr_feature$ColFeature
}The discriminant plots show:
# Maximum discrimination function analysis
# Identifies the most discriminative aspects of the functional data
cat("Analyzing maximum discrimination functions...")#> Analyzing maximum discrimination functions...
#> $DiscrFunctionsPlot
#>
#> $Separated
This analysis reveals:
Advanced analysis involving subclustering of identified clusters.
# Extract subjects from a specific cluster for subclustering
cat("Performing subclustering analysis on Cluster 3...")#> Performing subclustering analysis on Cluster 3...
# Get cluster assignments
cluster_assignments <- getClusters(ClusterData)
# Select subjects from cluster 3
cluster3_subjects <- cluster_assignments %>%
filter(Cluster == 3) %>%
pull(subjID)
cat("\n- Subjects in Cluster 3:", length(cluster3_subjects))#>
#> - Subjects in Cluster 3: 75
#>
#> - First few subjects: Subject 1105, Subject 1205, Subject 1301, Subject 1304, Subject 1308, Subject 1309
# Create subset data for subclustering
subData <- ConnectorData(
tibble(TimeSeries) %>% filter(subjID %in% cluster3_subjects),
tibble(Annotations) %>% filter(subjID %in% cluster3_subjects)
)#> ###############################
#> Data loaded...
#> Number of Observations:300
#> Number of distinct measures:2
#> Average length:7.046667
#>
#>
#> Subset data for subclustering:
#> CONNECTORData object with:
#> - Subjects: 75
#> - Measures: 2
#>
#> Lengths summary:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3 6 7 7.046667 9 9
#> Performing subclustering analysis...
#>
#> Estimated time: ~5-10 minutes
# Subcluster analysis within Cluster 3
subClusters <- estimateCluster(subData,
G = 2:4, # Test fewer clusters for subset
p = optimal_p, # Use same spline dimensions
runs = 20, # Multiple runs for stability
cores = 5)#> [1] "Total time: 1.39 mins"
# Select optimal subclustering
subClusterData <- selectCluster(subClusters, G=2, "MinfDB")
cat("\nSubclustering completed!")#>
#> Subclustering completed!
#> Subclustering results:
# Subcluster composition
if ("TTP" %in% getAnnotations(subClusterData)) {
subcluster_dist <- clusterDistribution(subClusterData, feature="TTP")
cat("\n\nSubcluster composition (TTP):")
print(subcluster_dist)
}#>
#>
#> Subcluster composition (TTP):# A tibble: 3 × 4
#> TTP `Cluster 1` `Cluster 2` Total
#> <chr> <int> <int> <dbl>
#> 1 0 49 6 55
#> 2 1 10 10 20
#> 3 TOTAL 59 16 75
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin23.4.0
#> Running under: macOS 15.5
#>
#> Matrix products: default
#> BLAS: /usr/local/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib
#> LAPACK: /usr/local/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Rome
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.2 dplyr_1.1.4 tibble_3.3.0 MultiConnector_1.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2
#> [4] viridis_0.6.5 fastmap_1.2.0 lazyeval_0.2.2
#> [7] promises_1.3.3 digest_0.6.37 mime_0.13
#> [10] lifecycle_1.0.4 ellipsis_0.3.2 statmod_1.5.0
#> [13] magrittr_2.0.3 compiler_4.4.1 sass_0.4.10
#> [16] rlang_1.1.6 tools_4.4.1 utf8_1.2.6
#> [19] yaml_2.3.10 geometry_0.5.2 data.table_1.16.4
#> [22] knitr_1.50 labeling_0.4.3 htmlwidgets_1.6.4
#> [25] pkgbuild_1.4.8 xml2_1.4.0 RColorBrewer_1.1-3
#> [28] pkgload_1.4.0 abind_1.4-8 miniUI_0.1.2
#> [31] withr_3.0.2 purrr_1.1.0 desc_1.4.3
#> [34] grid_4.4.1 roxygen2_7.3.2 urlchecker_1.0.1
#> [37] profvis_0.4.0 xtable_1.8-4 scales_1.4.0
#> [40] MASS_7.3-65 isoband_0.2.7 dichromat_2.0-0.1
#> [43] cli_3.6.5 rmarkdown_2.29 generics_0.1.4
#> [46] remotes_2.5.0 rlist_0.4.6.2 rstudioapi_0.17.1
#> [49] httr_1.4.7 tzdb_0.5.0 magic_1.6-1
#> [52] commonmark_2.0.0 sessioninfo_1.2.3 readxl_1.4.5
#> [55] cachem_1.1.0 stringr_1.5.1 splines_4.4.1
#> [58] parallel_4.4.1 cellranger_1.1.0 vctrs_0.6.5
#> [61] devtools_2.4.5 Matrix_1.7-3 jsonlite_2.0.0
#> [64] hms_1.1.3 patchwork_1.3.2 plotly_4.11.0
#> [67] tidyr_1.3.1 jquerylib_0.1.4 glue_1.8.0
#> [70] codetools_0.2-20 stringi_1.8.7 gtable_0.3.6
#> [73] later_1.4.4 pillar_1.11.0 htmltools_0.5.8.1
#> [76] R6_2.6.1 gghalves_0.1.4 rprojroot_2.1.1
#> [79] evaluate_1.0.4 shiny_1.11.1 lattice_0.22-7
#> [82] MetBrewer_0.2.0 readr_2.1.5 RhpcBLASctl_0.23-42
#> [85] memoise_2.0.1 bslib_0.9.0 httpuv_1.6.16
#> [88] Rcpp_1.1.0 gridExtra_2.3 xfun_0.53
#> [91] fs_1.6.6 usethis_3.1.0 pkgconfig_2.0.3
For two-dimensional functional clustering, the data must contain:
The two-dimensional clustering approach:
This comprehensive guide demonstrates the power of two-dimensional functional clustering for complex longitudinal data analysis. The MultiConnector package provides robust tools for identifying coordinated patterns across multiple measurements, enabling deeper insights into biological and clinical processes.